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CONFIDENCE SETS FOR SPLIT POINTS IN DECISION TREES 

By Moulinath Banerjee ^ and Ian W. McKeague^ 

University of Michigan and Columbia University 

We investigate the problem of finding confidence sets for split 
points in decision trees (CART). Our main results establish the asymp- 
totic distribution of the least squares estimators and some associated 
residual sum of squares statistics in a binary decision tree approx- 
imation to a smooth regression curve. Cube-root asymptotics with 
nonnormal limit distributions are involved. We study various confi- 
dence sets for the split point, one calibrated using the subsampling 
bootstrap, and others calibrated using plug-in estimates of some nui- 
sance parameters. The performance of the confidence sets is assessed 
in a simulation study. A motivation for developing such confidence 
sets comes from the problem of phosphorus pollution in the Ever- 
glades. Ecologists have suggested that split points provide a phos- 
phorus threshold at which biological imbalance occurs, and the lower 
endpoint of the confidence set may be interpreted as a level that 
is protective of the ecosystem. This is illustrated using data from 
a Duke University Wetlands Center phosphorus dosing study in the 
Everglades. 

1. Introduction. It has been over twenty years since decision trees (CART) 
came into widespread use for obtaining simple predictive rules for the classi- 
fication of complex data. For each predictor variable X in a (binary) regres- 
sion tree analysis, the predicted response splits according to whether X <d 
or X > d, for some split point d. Although the rationale behind CART is 
primarily statistical, the split point can be important in its own right, and 
in some applications it represents a parameter of real scientific interest. For 
example, split points have been interpreted as thresholds for the presence of 
environmental damage in the development of pollution control standards. In 
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a recent study [16] of the effects of phosphorus pollution in the Everglades, 
spht points are used in a novel way to identify threshold levels of phosphorus 
concentration that are associated with declines in the abundance of certain 
species. The present paper introduces and studies various approaches to 
finding confidence sets for such split points. 

The split point represents the best approximation of a binary decision 
tree (piecewise constant function with a single jump) to the regression curve 
E(Y\X = x), where Y is the response. Biihlmann and Yu [3] recently studied 
the asymptotics of split-point estimation in a homoscedastic nonparametric 
regression framework, and showed that the least squares estimator dn of 
the split point d converges at a cube-root rate, a result that is important 
in the context of analyzing bagging. As we are interested in confidence in- 
tervals, however, we need the exact form of the limiting distribution, and 
we are not able to use their result due to an implicit assumption that the 
"lower" least squares estimator /?/ of the optimal level to the left of the split 
point converges at i/n-rate (similarly for the "upper" least squares estima- 
tor f3u)- Indeed, we find that /3; and Pu converge at cube-root rate, which 
naturally affects the asymptotic distribution of d„, although not its rate of 
convergence. 

In the present paper we find the joint asymptotic distribution of {dn, A i Pu) 
and some related residual sum of squares (RSS) statistics. Homoscedasticity 
of errors is not required, although we do require some mild conditions on the 
conditional variance function. In addition, we show that our approach read- 
ily applies in the setting of generalized nonparametric regression, including 
nonlinear logistic and Poisson regression. Our results are used to construct 
various types of confidence intervals for split points. Plug-in estimates for 
nuisance parameters in the limiting distribution (which include the deriva- 
tive of the regression function at the split point) are needed to implement 
some of the procedures. We also study a type of bootstrap confidence inter- 
val, which has the attractive feature that estimation of nuisance parameters 
is eliminated, albeit at a high computational cost. Efron's bootstrap fails for 
dn (as pointed out by Biihlmann and Yu [3], page 940), but the subsampling 
bootstrap of Politis and Romano [14] still works. We carry out a simulation 
study to compare the performance of the various procedures. 

We also show that the working model of a piecewise constant function 
with a single jump can be naturally extended to allow a smooth parametric 
curve to the left of the jump and a smooth parametric curve to the right of 
the jump. A model of this type is a two-phase linear regression (also called 
break-point regression), as has been found useful, for example, in change- 
point analysis for climate data [12] and the estimation of mixed layer depth 
from oceanic profile data [20]. Similar models are used in econometrics, 
where they are called structural change models and threshold regression 
models. 
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In change-point analysis the aim is to estimate the locations of jump dis- 
continuities in an otherwise smooth curve. Methods to do this are well de- 
veloped in the nonparametric regression literature; see, for example, [1, 5, 9]. 
No distinction is made between the working model that has the jump point 
and the model that is assumed to generate the data. In contrast, confidence 
intervals for split points are model-robust in the sense that they apply un- 
der misspecification of the discontinuous working model by a smooth curve. 
Split-point analysis can thus be seen as complementary to change-point anal- 
ysis: it is more appropriate in applications (such as the Everglades example 
mentioned above) in which the regression function is thought to be smooth, 
and does not require the a priori existence of a jump discontinuity. The work- 
ing model has the jump discontinuity and is simply designed to condense key 
information about the underlying curve to a small number of parameters. 

Confidence intervals for change-points are highly unstable under model 
misspecification by a smooth curve due to a sharp decrease in estimator rate 
of convergence: from close to n under the assumed change-point model, to 
only a cube-root rate under a smooth curve (as for split-point estimators). 
This is not surprising because the split point depends on local features of 
a smooth regression curve which are harder to estimate than jumps. Mis- 
specification of a change-point model thus causes confidence intervals to be 
misleadingly narrow, and rules out applications in which the existence of an 
abrupt change cannot be assumed a priori. In contrast, misspecification of a 
continuous (parametric) regression model (e.g., linear regression) causes no 
change in the y/n-rate of convergence and the model-robust (Huber- White) 
sandwich estimate of variance is available. While the statistical literature on 
change-point analysis and model-robust estimation is comprehensive, split- 
point estimation falls in the gap between these two topics and is in need of 
further development. 

The paper is organized as follows. In Section 2 we develop our main results 
and indicate how they can be applied in generalized nonparametric regres- 
sion settings. In Section 3 we discuss an extension of our procedures to 
decision trees that incorporate general parametric working models. Simula- 
tion results and an application to Everglades data are presented in Section 4. 
Proofs are collected in Section 5. 

2. Split-point estimation in nonparametric regression. We start this sec- 
tion by studying the problem of estimating the split point in a binary deci- 
sion tree for nonparametric regression. 

Let X, Y denote the (one-dimensional) predictor and response variables, 
respectively, and assume that Y has a finite second moment. The nonpara- 
metric regression function f{x) = E{Y\X = x) is to be approximated using a 
decision tree with a single (terminal) node, that is, a piecewise constant func- 
tion with a single jump. The predictor X is assumed to have a density px, 
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and its distribution function is denoted Fx- For convenience, we adopt the 
usual representation Y = f{X) + e, with the error e = Y — E(Y\X) having 
zero conditional mean given X. The conditional variance of e given X = x 
is denoted a'^{x). 

Suppose we have n i.i.d. observations (Xi, li), (X2, 12), • • • , (^ru ^n.) of 
{X,Y). Consider the working model in which / is treated as a stump, that 
is, a piecewise constant function with a single jump, having parameters 
{(3h(3u,d), where d is the point at which the function jumps, (3i is the value 
to the left of the jump and (3u is the value to the right of the jump. Best 
projected values are then defined by 

(2.1) {f3]',(3ld^) = argmin^[y - f3il{X <d)- /3„1(X > d)f. 
Before proceeding, we impose some mild conditions. 

Conditions. 

(Al) There is a unique minimizer (/Jj*, /J^, d*^) of the expectation on the 
right-hand side of (2.1) with (3^ / 

(A2) f{x) is continuous and is continuously differentiable in an open 
neighborhood N of Also, f'{d^) / 0. 

(A3) px{x) does not vanish and is continuously differentiable on N . 

(A4) (T^(x) is continuous on A^. 

(A5) sup^gTv £^[e^l{|e| > rfWX = x]^ Q as i] ^ 00. 

The vector {(3i,P^,dP) then satisfies the normal equations 

Pf = EiY\X < d^), = EiY\X > d^), fid^) = 

The usual estimates of these quantities are obtained via least squares as 

n 

(2.2) 0iJu,dn) = argmin Y.[Yi - <d)- MX, > d)]\ 

Here and in the sequel, whenever we refer to a minimizer, we mean some 
choice of minimizer rather than the set of all minimizers (similarly for maxi- 
mizers). Our first result gives the joint asymptotic distribution of these least 
squares estimators. 

Theorem 2.1. // (Al)-(A5) hold, then 

n'^HPi - Pi, Pu - - d°) ^ (ci, C2, 1) argmaxQ(t), 

t 

where 

Q{t) = aW{t) -bt'^, 
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W is a standard two-sided Brownian motion process on the real line, c? = 

oH^(f>)px{df), 

with bo = \f'{d'^)\pxid^)/2 and 

' 2Fx{d^) ' ' 2(l-Fx(dO)) ■ 

In our notation, Biihlmann and Yu's [3] Theorem 3.1 states that n^/^{dn — 
arg maxj Qo (i) ) where Qo{t) = aW{t) — b^t'^. The first step in their 
proof assumes that it suffices to study the case in which {Pi,l3!^) is known. 
To justify this, they claim that (PhPu) converges at y^-rate to the popu- 
lation projected values (/S^^,/?^), which is faster than the n^/^-rate of con- 
vergence of dn to d^. However, Theorem 2.1 shows that this is not the case; 
all three parameter estimates converge at cube-root rate, and have a non- 
degenerate joint asymptotic distribution concentrated on a line through the 
origin. Moreover, the limiting distribution of dn differs from the one stated 
by Biihlmann and Yu because b^bo; their limiting distribution will appear 
later in connection with (2.8). 

Wald-type confidence intervals. It can be shown using Brownian scaling 
(see, e.g., [2]) that 

(2.3) Q{t) = a{a/b)^/^Qi{{b/afH), 

where Qi{t) = W{t) — t^, so the limit in the above theorem can be expressed 
more simply as 

(ci , C2 , 1 ) (a/6)^/^ arg max Qi{t). 

t 

Let Pa/2 denote the upper a/2-quantile of the distribution of argmax^ Qiit) 
(this is symmetric about 0), known as Chernoff's distribution. Accurate val- 
ues of for selected values of a, are available in [10], where numerical 
aspects of Chernoff's distribution are studied. Utilizing the above theorem, 
this allows us to construct approximate 100(1 — a)% confidence limits si- 
multaneously for all the parameters {f3f , f3^,d^) in the working model: 

(2-4) . ^ . ^ 

where 6n=n ^'^id/b)'^'^Pa/2, 

given consistent estimators ci,C2,d,b of the nuisance parameters. The den- 
sity and distribution function of X at d^ can be estimated without difficulty. 
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since an i.i.d. sample from the distribution of X is available. The derivative 
f'{dP) and the conditional variance a'^{(P) are harder to estimate, but many 
methods to do this are available in the literature, for example, local polyno- 
mial fitting with data-driven local bandwidth selection [18] . 

These confidence intervals are centered on the point estimate and have 
the disadvantage of not adapting to any skewness in the sampling distribu- 
tion, which might be a problem in small samples. A more serious problem, 
however, is that the width of the interval is proportional to a/b, which blows 
up if b is small relative to a. It follows from Theorem 2.1 that in the presence 
of conditions (A2)-(A5), the uniqueness condition (Al) fails if 6 < 0. More- 
over, 6 < if the gradient of the regression function is less than the jump 
in the working model multiplied by the density of X at the split point, 
\f'{dP)\ < px{dP)\f3u ~ This suggests that the Wald-type confidence in- 
terval becomes unstable if the regression function is flat enough at the split 
point. 

Subsampling. Theorem 2.1 also makes it possible to avoid the estimation 
of nuisance parameters by using the subsampling bootstrap, which involves 
drawing a large number of subsamples of size m = rUn from the original 
sample of size n (without replacement). Then we can estimate the limiting 
quantiles of 'n}/^{dn — dP) using the empirical distribution of vn}/^{d^ — dn); 
here dj^ is the value of the split-point of the best fitting stump based on the 
subsample. For consistent estimation of the quantiles, we need m/n— > 0. In 
the literature m is referred to as the block-size; see [15]. The choice of m has 
a strong effect on the precision of the confidence interval, so a data-driven 
choice of m is recommended in practice; Delgado, Rodriguez-Poo and Wolf 
[4] suggest a bootstrap-based algorithm for this purpose. 

Confidence sets based on residual sums of squares. Another strategy is 
to use the quadratic loss function as an asymptotic pivot, which can be 
inverted to provide a confidence set. Such an approach was originally sug- 
gested by Stein [19] for a multivariate normal mean and has recently been 
used by Genovese and Wasserman [8] for nonparametric wavelet regression. 
To motivate the approach in the present setting, consider testing the null 
hypothesis that the working model parameters take the values {(5i,(3u, d). Un- 
der the working model with a constant error variance, the likelihood-ratio 
statistic for testing this null hypothesis is given by 

n 

RSSoiPl, (3u,d) = J2iYi- <d)- /3„1(X, > d)f 

i=l 

n 

- Y,{Y, - kl{X, < dn) - PuliXi > dn)f. 
i=l 



CONFIDENCE SETS FOR SPLIT POINTS 7 

The corresponding profiled RSS statistic for testing the null hypothesis that 
dP = d replaces /?/ and (3u in RSSq by their least squares estimates under the 
null hypothesis, giving 

n 

RSSi((i) = - PtHX^ <d)- /3^1(X, > d)f 

i=l 

n 

- Y,iY^ - < dn) - > 4))', 
1=1 

where 

n 

0fJi) = argmin^(y, - <d)- > d)f. 

Our next result provides the asymptotic distribution of these residual 
sums of squares. 

Theorem 2.2. // (Al)-(A5) hold, then 

n-V3RSSo(A°,/3°,d°) ^ 2|A° - maxQ(t), 

where Q is given in Theorem 2.1, and n~^^^I{SSi{d^) has the same limiting 
distribution. 

Using the Brownian scaling (2.3), the above limiting distribution can be 
expressed more simply as 

2|A°-/3°|a(a/6)^/3^axgi(t). 

This leads to the following approximate 100(1 — a)% confidence set for the 
split-point: 

(2.5) {d:RSSi(d) < 2n^/''\Pi-^u\d{a/bY/^qa}, 

where qa is the upper a-quantile of max^ Qi{t). This confidence set becomes 
unstable if 5 is small relative to o, as with the Wald-type confidence interval. 
This problem can be lessened by changing the second term in RSSi to make 
use of the information in the null hypothesis, to obtain 

n 

RSS2(d) = Y.iYi - PfHX^ <d)- ^Pt^{Xi > d)f 

i=l 

n 

- Y^iYi - /3f i(x, < di) - pii{x, > di))\ 

i=l 
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where 

n 

(2.6) di = ^vgmmY,{Yi - /3f < d') - > d')f . 

d' i=l 

The foUowing result gives the asymptotic distribution of RSS2(d'')- 

Theorem 2.3. // (Al)-(A5) hold, then 

n-^/3j^SS2(d°) ^ 2\f3f - Pl\ maxQo(i), 

where Qo{t) =aW(t) — hot^ , and a, bo are given in Theorem 2.1. 

This leads to the following approximate 100(1 — a)% confidence set for 
the split point: 

(2.7) {d: RSS2(d) < 2ni/3|/3; - /3Ja(a/6o)'/'ga}, 

where 6o is a consistent estimator of . This confidence set could be unstable 
if bo is small compared with a, but this is less likely to occur than the 
instability we described earlier because bo > b. The proof of Theorem 2.3 
also shows that n^/^((i^° — d^) converges in distribution to argmaxj Qol^)) 
recovering the limit distribution in Theorem 3.1 of [3], and this provides 
another pivot-type confidence set for the split point, 

(2.8) {d:\di- d\<n-'/\d/bo?/-'p^/,}. 

Typically, (2.5), (2.7) and (2.8) are not intervals, but their endpoints, or the 
endpoints of their largest component, can be used as approximate confidence 
limits. 

Remark 1. The uniqueness condition (Al) may be violated if the re- 
gression function is not monotonic on the support of X. A simple example in 
which uniqueness fails is given by f{x) = and X ~ Unif[— 1, 1], in which 
case the normal equations for the split point have two solutions, d^ = ibl/\/2, 
and the corresponding and are different for each solution; neither split 
point has a natural interpretation because the regression function has no 
trend. More generally, we would expect lack of unique split points for re- 
gression functions that are unimodal on the interior of the support of X. In 
a practical situation, split-point analysis (with stumps) should not be used 
unless there is reason to believe that a trend is present, in which case we ex- 
pect there to be a unique split point. An increasing trend, for instance, gives 
that E{Y\X < d) < E{Y\X > d) for all d, so a unique split point will exist 
provided the normal equation g{d) = has a unique solution, where g is the 
"centered" regression function g{d) = fid) - iE{Y\X < d) + E{Y\X > d))/2. 
A sufficient condition for g{d) = to have a unique solution is that g is 
continuous and strictly increasing, with g{xo) < and g{xi) > for some 
xq < xi in the support of X. 
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Generalized nonparametric regression. Our results apply to split point 
estimation for a generalized nonparametric regression model in which the 
conditional distribution of Y given X is assumed to belong to an exponen- 
tial family. The canonical parameter of the exponential family is expressed 
as 0{X) for an unknown smooth function 9{-), and we are interested in 
estimation of the split point in a decision tree approximation of 9{-). Non- 
parametric estimation of 6{-) has been studied extensively; see, for example, 
[6], Section 5.4. Important examples include the binary choice or nonlinear 
logistic regression model Y\X ~ Ber(/(X)), where f{x) = e^(^V(l + e^^""^), 
and the Poisson regression model Y\X ~ Poi(/(X)), where f{x) = e^^^\ 

The conditional density of Y given X = x is specified as 

p{y\x) = exp{e{x)y - B{e{x))}h{y), 

where B{-) and h{-) are known functions. Here p{-\x) is a probability den- 
sity function with respect to some given Borel measure Here the cumulant 
function B is twice continuously differentiable and B' is strictly increasing 
on the range of 0{-). It can be shown that f{x) = E{Y\X = x) = B'{6{x)), 
or equivalently 9{x) = 'ip{f{x)), where ip = {B')~^ is the link function. For 
logistic regression 'ip{t) =log(t/(l —t)) is the logit function, and for Poisson 
regression ■ip{t) = log(t). The link function is known, continuous and strictly 
increasing, so a stump approximation to 0{x) is equivalent to a stump ap- 
proximation to f{x), and the split points are identical. Exploiting this equiv- 
alence, we define the best projected values of the stump approximation for 
9{-) as {^p{f3f),iJ{P^),(f), where /?°, d°) are given in (2.1). 

Our earlier results apply under a reduced set of conditions due to the 
additional structure in the exponential family model: we only need (Al), 
(A2) with 9{-) in place of /, and (A3). It is then easy to check that the 
original assumption (A2) holds; in particular, f'{(f) = B"{9{(f))e'{(f) / 0. 
To check (A4), note that a'^{x) = Var(y|X = x) = B"{6{x)) is continuous in 
X. Finally, to check (A5), let be a bounded neighborhood of (P . Note that 
/(•) and 0(-) are bounded on A^. Let 9q = \nix&i\i 9{x) and 6*1 = sup^jg^y 0(x). 
For Tj sufficiently large, {y-\y — f{x)\ > r]} <Z {y: \y\ > r//2} for all x £ N, 
and consequently 

sup£^[e^l{|e| >?7}|A = x] = sup / [y - f{x)fp{y\x)d^i{y) 

x£N x£N J\y~f{x)\>ri 

<C I {y^ + l){e'^y + e'^y)h{y) dfi{y) ^ 

as r/ — > oo, where C is a constant (not depending on 77). The last step follows 
from the dominated convergence theorem. 

We have focused on confidence sets for the split point, but Pi and (3^ may 
also be important. For example, in logistic regression where the response Y 
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is an indicator variable, the relative risk 

r = P{Y = 1\X > (f)/P{Y = l\X < (f) = Pl/Pf 

is useful for comparing the risks before and after the split point. Using 
Theorem 2.1 and the delta method, we can obtain the approximate 100(1 — 
a)% confidence limits 



expflog(/3./A)±f^-^)5n) 



for r, where 5n is defined in (2.4) and it is assumed that ci//3; / C2//3u to 
ensure that Pu/Pi has a nondegenerate limit distribution. The odds ratio 
for comparing P(Y = 1\X < (fi) and P{Y = l\X > (fi) can be treated in a 
similar fashion. 

3. Extending the decision tree approach. We have noted that split-point 
estimation with stumps should only be used if a trend is present. The split- 
point approach can be adapted to more complex situations, however, by 
using a more fiexible working model that provides a better approximation 
to the underlying regression curve. In this section, we indicate how our main 
results extend to a broad class of parametric working models. The proofs 
are omitted as they run along similar lines. 

The constants Pi and are now replaced by functions ^'/(/?/,a;) and 
^uiPujx) specified in terms of vector parameters Pi and These functions 
are taken to be twice continuously differentiable with respect to Pi G M™ and 

€ M'^, respectively, and continuously differentiable with respect to x. The 
best projected values of the parameters in the working model are defined by 

(A°,/3°,d°) = argmini?[y - ^i{pi,X)l{X < d) 

(3.1) 

-^u{Pu,X)l{X>d)]\ 



and the corresponding normal equations are 



E 



E 



-^^^i{PlX){Y-^>i{PlX))l{X<d') 
^ -^u{PlX){Y - ^>u{PlX))l{X > 



0, 



0, 



dpu 

and /((i°) = ^'((i°), where ^(x) = (^'/(/3f , x) + ^'„(/30, x))/2. The least squares 
estimates of these quantities are obtained as 

n 

iPlJuJn) = argmin^[y, - ^;(A,Xi)l(X, < d) 

(3.2) 

-^u{Pu,Xi)l[X,>d)f. 



CONFIDENCE SETS FOR SPLIT POINTS 11 

To extend Theorem 2.1, we need to modify conditions (Al) and (A2) as 
follows: 

(Al)' There is a unique minimizer (fSf , P^,dP) of the expectation on the 
right-hand side of (3.1) with '^i{pf,d^) ^ ^'„(/3°,dO). 

(A2)' f{x) is continuously differentiable in an open neighborhood N of 
(f. Also, /'(dO)/^'(dO). 

In addition, we need the following Lipschitz condition on the working model: 

(A6) There exist functions "^[(x) and ^'„(x), bounded on compacts, such 
that 

\^l{Pl,x) -^l{Pl,x)\<4>i{x)\Pi- 

and 

with ^i(X), ^i{Pf,X),'^u{X), ^uiPu^ X) having finite fourth moments, where 
I • I is Euclidean distance. 

Condition (A6) holds, for example, if ^'i(/?/, x) and ^'u(/?u, x) are polynomials 
in X with the components of /3/ and serving as coefficients, and X has a 
finite moment of sufficiently high order. 

Theorem 3.1. // (Al)', (A2)' and (A3)-(A6) hold, then 

n^/\Pl - Pfju - Pi dn - (i°) 4 argmmW{h), 

h 

where W is the Gaussian process 

W{h) = dW{hm+k+i) + h^Vh/2, h G W^+^+^, 
V is the (positive definite) Hessian matrix of the function 

{Pi,f3u, d) ^ E[Y - ^i{(3i,X)l{X <d)- ^u{Pu,X)l{X > d)f 
evaluated at {Pf,pld^), andd = 2\^i{(3^,d'^)-'^u{Pld^)\{a'^{d^)px{d'^))^/^ . 

Remark 2. As in the decision tree case, subsampling can now be used 
to construct confidence intervals for the parameters of the working model. 
Although Brownian scaling is still available [minimizing W{h) by first hold- 
ing hm+k+i fixed] , the construction of Wald-type confidence intervals would 
be cumbersome, needing estimation of all the nuisance parameters involved 
in d and V. The complexity of V is already evident when Pi and Pu are 
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one-dimensional, in which case direct computation shows that V is the 3x3 
matrix with entries V12 = V21 = 0, 



^11 = 2 Px{x)dx 



dO ^ 



+ 2 / -T^^iiP?,x)i^i{(3f,x) - f{x))px{x)dx, 



roo / ff \ 2 



+ 2 / ^'^u{Pu,x){^u{Pu,x)-f{x))px{x)dx, 

Jdo dpi 

^33 = 2|(^K/3°,dO) - M/KA°,^°))(/'('^°) - ^''(d°))|px(d°), 

(jpi 

V2Z = Vz2 = - ^u{l3ld^))-^J>uWl(f)Px{cP). 

Next we show that extending Theorem 2.3 ahows us to circumvent this 
problem. Two more conditions are needed: 

(A7) - ^u{l3lx)){f{x) - ^{x))px{x)dx^Q, for Z? = (-00, 

dO] and D= [dO,oo). 

(A8) - (if) = Op(l) and - (3l) = 0^(1), where $f and 

are defined in an analogous fashion to Section 2. 

Note that (A8) holds automatically in the setting of Section 2, using the 
central limit theorem and the delta method. In the present setting, suffi- 
cient conditions for (A8) can be easily formulated in terms of and 
the joint distribution of (X, y), using the theory of Z-estimators. If we 
define (t>p,{x,y) = {y - ^i{pi,x)){d^i{(3i,x)/d(3i)l{x < d°), then f3f satisfies 
the normal equation -Pt^^ = 0, while pf satisfies fn(t>Pi = 0, where P„ is 
the empirical distribution of {Xi,Yi). Sufficient conditions for the asymp- 
totic normality of y/n^f" — f3f) are then given by Lemma 3.3.5 of [21] (see 
also Examples 3.3.7 and 3.3.8 in Section 3.3 in [21], which are special cases 
of Lemma 3.3.5 in the context of finite-dimensional parametric models) in 
conjunction with (3 ^ P4>p possessing a nonsingular derivative at [i'^ . In 
particular, if and are polynomials in x with the f3i and f3u serving as 
coefficients, then the displayed condition in Example 3.3.7 is easily verifiable 
under the assumption that X has a finite moment of a sufficiently high order 
(which is trivially true if X has compact support). 
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Defining 

n 

RSS2(d) = Y.{Y, - ■^i0f,X,)l{Xi <d)- ■^u0tXi)l{Xi > d)f 

i=l 

n 

- J2iY, - ^iiPt,X,)l{X, < di) - ^M,Xi)l{X, > di))\ 

i=l 

where 

n 

di = argmin^(y, - i{M , X,)l{Xi < d') - ■^u0tX^)l{X, > d'))\ 
d' i=l 

we obtain the following extension of Theorem 2.3. 

Theorem 3.2. // (Al)', (A2)', (A3)-(A5), (A7) and (A8) hold, and 
the random variables '^i[X), ^i[l3f,X), ^'^(X) and ^u{(fiiX) are square 
integrahle, then 

n"i/3RSS2(fi°) ^ 2\^>i{[3ld'^) - ^u{Pl(f)\ maxQo(i), 

where QQ{t)=aW{t)-bot^, and = a'^{d^)px{.(f) , ho = \f'{d^)-^'{d^)\x 
Px{<f). 

Application of the above result to construct confidence sets [as in (2.7)] is 
easier than using Theorem 3.1, since estimation of a and 6o requires much less 
work than estimation of the matrix V] the latter is essentially intractable, 
even for moderate k and m. 

4. Numerical examples. In this section we compare the various confi- 
dence sets for the split point in a binary decision tree using simulated data. 
We also develop the Everglades application mentioned in the Introduction. 

4.1. Simulation study. We consider a regression model of the form Y = 
f{X) + e, where X ~ Unif[0, 1] and e\X ~ N{0,a^{X)). The regression func- 
tion / is specified as the sigmoid (or logistic distribution) function 

/(x) = e^5(^-0-5V(l + e''^""°-'^)- 

This increasing S-shaped function rises steeply between 0.2 and 0.8, but is 
relatively flat otherwise. It is easily checked that d^ = 0.5, Pi = 0.092 and 
= 0.908. We take (T^(x) = 0.25 to produce an example with homoscedastic 
error, and cr^(x) = exp(— 2.77x) for an example with heteroscedastic error; 
these two error variances agree at the split point. 
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Table 1 

Coverage and average confidence interval length, =0.25 



Subsampling Wald RSSi RSS2 



n 


Coverage 


Length 


Coverage 


Length 


Coverage 


Length 


Coverage 


Length 


75 


0.957 


0.326 


0.883 


0.231 


0.942 


0.273 


0.957 


0.345 


100 


0.970 


0.283 


0.894 


0.210 


0.954 


0.235 


0.956 


0.280 


200 


0.978 


0.200 


0.926 


0.167 


0.952 


0.174 


0.959 


0.198 


500 


0.991 


0.136 


0.947 


0.123 


0.947 


0.118 


0.948 


0.128 


1000 


0.929 


0.093 


0.944 


0.097 


0.955 


0.091 


0.952 


0.098 


1500 


0.936 


0.098 


0.947 


0.085 


0.933 


0.078 


0.921 


0.083 


2000 


0.944 


0.090 


0.954 


0.077 


0.935 


0.070 


0.939 


0.074 



To compute the subsampling confidence interval, a data-driven choice of 
block size was not feasible computationally. Instead, the block size was de- 
termined via a pilot simulation. For a given sample size, 1000 independently 
replicated samples were generated from the (true) regression model, and for 
each data set a collection of subsampling-based intervals (of nominal level 
95%) was constructed, for block sizes of the form m„ = n'^, for 7 on a grid 
of values between 0.33 and 0.9. The block size giving the greatest empirical 
accuracy (in terms of being closest to 95% coverage based on the replicated 
samples) was used in the subsequent simulation study. To provide a fair 
comparison, we used the true values of the nuisance parameters to calibrate 
the Wald- and RSS-type confidence sets. For RSSi and RSS2 we use the 
endpoints of the longest connected component to specify confidence limits. 

Tables 1 and 2 report the results of simulations based on 1000 replicated 
samples, with sample sizes ranging from 75 to 2000, and each confidence 
interval (CI) calibrated to have nominal 95% coverage. The subsampling CI 
tends to be wider than the others, especially at small sample sizes. The Wald- 
type CI suffers from severe undercoverage, especially in the heteroscedastic 
case and at small sample sizes. The RSSi-type CI is also prone to undercov- 
erage in the heteroscedastic case. The RSS2-type CI performs well, although 
there is a slight undercoverage at high sample sizes (the interval formed by 
the endpoints of the entire confidence set has greater accuracy in that case). 

4.2. Application to Everglades data. The "river of grass" known as the 
Everglades is a majestic wetland covering much of South Florida. Severe 
damage to large swaths of this unique ecosystem has been caused by pol- 
lution from agricultural fertilizers and the disruption of water flow (e.g., 
from the construction of canals). Efforts to restore the Everglades started 
in earnest in the early 1990s. In 1994, the Florida legislature passed the Ev- 
erglades Forever Act, which called for a threshold level of total phosphorus 
that would prevent an "imbalance in natural populations of aquatic flora or 
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fauna." This threshold may eventually be set at around 10 or 15 parts per 
billion (ppb), but it remains undecided despite extensive scientific study and 
much political and legal debate; see [17] for a discussion of the statistical 
issues involved. 

Between 1992 and 1998, the Duke University Wetlands Center (DUWC) 
carried out a dosing experiment at two unimpacted sites in the Everglades. 
This experiment was designed to find the threshold level of total phosphorus 
concentration at which biological imbalance occurs. Changes in the abun- 
dance of various phosphorus-sensitive species were monitored along dosing 
channels in which a gradient of phosphorus concentration had been estab- 
lished. Qian, King and Richardson [16] analyzed data from this experiment 
using Bayesian change-point analysis, and also split point estimation with 
the split point being interpreted as the threshold level at which biological 
imbalance occurs. Uncertainty in the split point was evaluated using Efron's 
bootstrap. 

We illustrate our approach with one particular species monitored in the 
DUWC dosing experiment: the bladderwort Utricularia Purpurea, which is 
considered a keystone species for the health of the Everglades ecosystem. 
Figure 1 shows 340 observations of stem density plotted against the six- 
month geometric mean of total phosphorus concentration. The displayed 
data were collected in August 1995, March 1996, April 1998 and August 
1998 (observations taken at unusually low or high water levels, or before 
the system stabilized in 1995, are excluded). Water levels fluctuate greatly 
and have a strong influence on species abundance, so a separate analysis 
for each data collection period would be preferable, but not enough data 
are available for separate analyses and a more sophisticated model would be 
needed, so for simplicity we have pooled all the data. 

Estimates of px, f and cr^ needed for a, b and bo, and the estimate of 
/ shown in Figure 1, are found using David Ruppert's (Matlab) implemen- 

Table 2 

Coverage and average confidence interval length, a^{x) = exp{—2.77x) 



Subsampling Wald RSSi RSS2 



n 


Coverage 


Length 


Coverage 


Length 


Coverage 


Length 


Coverage 


Length 


75 


0.951 


0.488 


0.863 


0.231 


0.929 


0.270 


0.949 


0.354 


100 


0.957 


0.315 


0.884 


0.210 


0.923 


0.231 


0.944 


0.283 


200 


0.977 


0.257 


0.915 


0.167 


0.939 


0.173 


0.949 


0.196 


500 


0.931 


0.124 


0.926 


0.123 


0.936 


0.117 


0.948 


0.128 


1000 


0.917 


0.095 


0.941 


0.097 


0.948 


0.090 


0.945 


0.097 


1500 


0.938 


0.083 


0.938 


0.085 


0.928 


0.078 


0.922 


0.083 


2000 


0.945 


0.076 


0.930 


0.077 


0.933 


0.070 


0.934 


0.074 




Total Phosphorus 

Fig. 1. Data from the DUWC Everglades phosphorus dosing study showing variations in 
hladderwort (Utricularia P.) stem density (number of stems per square meter) in response 
to total phosphorus concentration (six-month geometric mean, units of ppb). The vertical 
solid lines show the limits of the RSS2-type 95% confidence interval for the split point. 
The vertical dashed lines show the limits of the subsampling confidence interval. The local 
polynomial regression fit is also plotted. 



tation of local polynomial regression and density estimation with empirical- 
bias bandwidth selection [18]. The estimated regression function shows a 
fairly steady decrease in stem density with increasing phosphorus concen- 
tration, but there is no abrupt change around the split point estimate of 12.8 
ppb, so we expect the CIs to be relatively wide. The 95% Wald-type and 
RSSi-type CIs for the split point are 0.7-24.9 and 9.7-37.1 ppb, respectively. 
The instability problem mentioned earlier may be causing these CIs to be so 
wide (here a/b = 722). The subsampling and RSS2-type CIs are narrower, at 
8.5-17.1 and 7.1-26.1 ppb, respectively (see the vertical lines in Figure 1), 
but they still leave considerable uncertainty about the true location of the 
split point. The 10-ppb threshold recommended by the Florida Department 
of Environmental Protection [13] falls into these CIs. 
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The interpretation of the spht point as a biological threshold is the source 
of some controversy in the debate over a numeric phosphorus criterion [13]. 
It can be argued that the split point is only crudely related to biological re- 
sponse and that it is a statistical construct depending on an artificial working 
model. Yet the split point approach fulfills a clear need in the absence of 
better biological understanding, and is preferable to a change-point analysis 
in this application, as discussed in the Introduction. 

5. Proofs. The proofs have certain points in common with Biihlmann 
and Yu [3] and Kim and Pollard [11], but to make them more self-contained 
we mainly appeal to general results on empirical processes and M-estimation 
that are collected in [21]. 

We begin by proving Theorem 2.3, which is closely related to Theorem 
3.1 of [3]. 

Proof of Theorem 2.3. We derive the joint limiting distribution of 

{n^/^ (df -d°),n-i/3j^SS2((i°)), 

the marginals of which are involved in calibrating the confidence sets (2.7) 
and (2.8). To simplify the notation, we denote (Pfjfjf) by (/3f , /^o, d°). 
Also, we assume that (3i > /3^; the derivation for the other case is analogous. 
Letting P„ denote the empirical measure of the pairs {{Xi,Yi),i = 1, . . . , n}, 
we can write 

n 

RSS2(dO) = J2{Y, - Pf)\l{X, < /) - 1(X, < (1^)) 

i=l 

n 

+ - > d') - HX, > rfO )) 

i=l 

= nP„[((y - P^f -{Y- P'j'mX < d") - 1{X < dOj)] 



Y-^±^]il{X<di)-l{X<d')) 



2 



Therefore, 



n"i/3j^SS2((i°) 



= 20f - /30)n2/3p„[(y - f{cf)){l{X < d^J - 1{X < d^))], 

where /((i°) = (/3f + /3°)/2. Let 

Ud) = n2/3p„[(y - /(dO))(l(X <d)- 1{X < d'))] 

and let dn be the maximizer of this process. Since Pi — (3^ ^ Pi — 
almost surely, it is easy to see that dn = dn ™ sufficiently large almost 
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surely. Hence, the limiting distribution of n~^/^RSS2((i'^) must be the same 
as that of 2(/3[' — P2)^n{dn), which in turn is the same as that of 2(/3[' — 
Pu)Cn{dn) [provided Cnidn) has a limit distribution], because (3i and (3^ are 
-y/n-consistent. Furthermore, the limiting distribution of n^^^{d^ — d") is the 
same as that of n^^^{dn — dP) (provided a limiting distribution exists). 

Let Quit) = S,n{d^ + t'n~^/^) and f„ = argmax^ Qnii), so that £„ = v}^^{dn — 
dP). It now suffices to find the joint limiting distribution of {tn,Qn{in))- 
Lemma 5.1 below shows that Quit) converges in distribution in the space 
5ioc(M) (the space of locally bounded functions on M equipped with the 
topology of uniform convergence on compacta) to the Gaussian process 
Qo{t) = aW{t) — hot^ whose distribution is a tight Borel measure concen- 
trated on Cmax(IK) [the separable subspace of -Bioc(lK) of all continuous func- 
tions on M that diverge to — oo as the argument runs off to ±00 and that 
have a unique maximum]. Furthermore, Lemma 5.1 shows that the sequence 
{tri} of maximizers of {Qn{t)} is Op{l). By Theorem 5.1 below, we conclude 
that {tn,Qn(tn)) — >d (arg maxj (5o(^)) niaxf Qo(i))- This completes the proof. 
□ 

The following theorem provides sufficient conditions for the joint weak 
convergence of a sequence of maximizers and the corresponding maxima 
of a general sequence of processes in i3ioc(lR). A referee suggested that an 
alternative approach would be to use L'(M) (the space of right-continuous 
functions with left-limits equipped with Lindvall's extension of the Skorohod 
topology) instead of i?ioc(K), as in an argmax-continuous mapping theorem 
due to Ferger ([7], Theorem 3). 

Theorem 5.1. Let {Qn{t)} a sequence of stochastic processes con- 
verging in distribution in the space Sioc(M*^) to the process Q{t), whose dis- 
tribution is a tight Borel measure concentrated on Cinax(lK^)- If {tn} is a 
sequence of maximizers of {Q nit)} such that t^ = Op{\) , then 

{in,Qn{in)) ^arg Hiax Q(t) , max (^(t)^ 

Proof. For simplicity, we provide the proof for the case that k = \\ the 
same argument essentially carries over to the /c-dimensional case. By invok- 
ing Dudley's representation theorem (Theorem 2.2 of [11]), for the processes 
Qn, we can construct a sequence of processes Qn and a process Q defined 
on a common probability space (f2,^, P) with (a) Qn being distributed as 
Qn, (b) Q being distributed as Q and (c) Qn converging to Q almost surely 
(with respect to P) under the topology of uniform convergence on compact 
sets. Thus, (i) t„, the maximizer of Qn, has the same distribution as i„, (ii) 
t, the maximizer of Q{t), has the same distribution as argmax(5(t) and (iii) 
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Qn{in) and Q{t) have the same distribution as Qn{in) and max((5(t), re- 
spectively. So it suffices to show that t„ converges in P* (outer) probabihty 
to i and Qn{in) converges in P* (outer) probabihty to Q{i). The convergence 
of tn to t in outer probabihty is shown in Theorem 2.7 of [11]. 

To show that Qn{tn) converges in probabihty to Q(t), we need to show 
that for fixed e > 0,(^ > 0, we eventuahy have 

P%\Qn{in)-Q{i)\>5)<e. 

Since t„ and t are Op(l), there exists > such that, with 

K = {in i [-M„M,]}, B'^ ^ {H [-A4,M,]}, 

P*{A'^) < e/4 and P*{B^) < e/4, eventuahy. Furthermore, as Qn converges 
to Q almost surely and therefore in probability, uniformly on every compact 
set, with 

C'n^l sup \Qn{s)-Q{s)\>6\ 

ise[-M,,Me] J 

we have P*{C^) < e/2, eventually. Hence, P*(^^ U U Q) < e, so that 
P^{An riBnCiCn) > 1 - e, eventually. But 

(5.1) An n Bn n Cn c {|Q„(t„) - Qii)\ < 5}, 

and consequently 

P^ilQniin) - Qii)\ <5)> PMn H B,, D Cn) > 1 " £, 

eventually. This implies immediately that 

P^{\Qn{in)-Q{i)\>S)<e 

for all sufficiently large n. It remains to show (5.1). To see this, note that 
for any to £ AnDBnOCn and s £ [-Mg, M^], 

Qn{s) = Q{s) + Qn{s) - Q{s) < Q{t) + |Q„(s) - Q{s)\. 

Taking the supremum over s G [— Me,Me] and noting that tn G [—Me^Me] 
on the set An H Bn n C„ , we have 

Qn{in) < Q{i) + sup \Qn{s) - Q(s)|, 

or equivalently 

Qn{in) - Q{i) < sup \Qn{s) - Q(s)|. 

An analogous derivation (replacing Qn everywhere by Q and t„ by t, and 
vice versa) yields 

Q{i) - Qn{in) < sup \Q{s) - Qn{s)\. 

se[~M,,Me] 
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Thus 



\Qn{in) - Q{i)\ < sup \Qn{s)-Q{s)\<6, 



which completes the proof. □ 

The foUowing modification of a rate theorem of van der Vaart and Wehner 
([21], Theorem 3.2.5) is needed in the proof of Lemma 5.1. The notation < 
means that the left-hand side is bounded by a generic constant times the 
right-hand side. 

Theorem 5.2. Let G and J- he semimetric spaces. Let Wln{9,F) be 
stochastic processes indexed by 9 € Q and F £ J^. Let Wl{6,F) be a deter- 
ministic function, and let {9q,Fq) be a fixed point in the interior of Q x T. 
Assume that for every 9 in a neighborhood of 9q, 

(5.2) M(^, Fo) - M(0o, i^o) < -d\9, 9^), 

where d{-,-) is the semimetric for 0. Let 9n be a point of maximum of 
M.n{9 , Fn) , where Fn is random. For each e > 0, suppose that the following 
hold: 

(a) There exists a sequence Tn,e, n = 1, 2, . . . , of metric subspaces of T , 
each containing Fq in its interior. 

(b) For all sufficiently small 5 > (say 5 < Jqi where does not depend 
on e), and for all sufficiently large n, 



(5.3) 



E* sup \{Mn{9,F)-M{9,F^))-{Mn{9o,F)-M{9^,F^))\ 
d{e,eo)<s 



n 



for a constant > and functions (pn (not depending on e) such that 
b (j)n{S)/5°' is decreasing in 6 for some constant a<2 not depending on 
n. 

(c) P{Fn ^ ^n,e) < ^ for n Sufficiently large. 
Ifrn4>n{r~^) < ^/n for every n and 9n -^p 9o, then r„(i(6'„, 6*0) = Op(l). 

Lemma 5.1. The process Qn{t) defined in the proof of Theorem 2.3 con- 
verges in distribution in the space i?ioc(IK) to the Gaussian process Qo{t) = 
aW{t) — 6ot^, whose distribution is a tight Borel measure concentrated on 
Cmax(lR)- Flere a and bo are defined in Theorem 2.1. Furthermore, the se- 
quence {tn} of maximizers of {Qn{t)} is Op{l) [and hence converges to 
argmaxj (5o(i) by Theorem 5.1]. 
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Proof. We apply the general approach outlined on page 288 of [21]. 
Define 

M„(d) = p„[(y - f{cf)){i{x <d)- i{x < (f))], 

M(d) = P[{Y - f{(f)){l{X <d)- 1{X < d°))]. 

Now, dn = argmax^gjg M„((i) and d^ = argmax^gjg M(d) and, in fact, d^ 
is the unique maximizer of M under the stipulated conditions. The last 
assertion needs proof, which will be supplied later. We establish the con- 
sistency of dn for d^ and then find the rate of convergence r„ of dn, in 
other words, that r„ for which rn{dn — d^) is Op(l). To establish the consis- 
tency of dn for d^ , we apply Corollary 3.2.3 (i) of [21]. We first show that 
suprfgK|M„((i) — M{d) \ -^p 0. We can write 

sup|M„(d) -M((i)| 

< sup|(P„ - P)[{Y - fid')){l{X <d)- 1{X < d'm 
+ sup|P„[(/(dO) - f{d'))iliX <d)- 1{X < dO))]|. 

The class of functions {{Y - f {d^)){l{X < d) -1{X < d")) : d G M} is VC with 
a square integrable envelope [since E[Y'^) < oo] and consequently Glivenko- 
Cantelli in probability. Thus the first term converges to zero in probability. 
The second term is easily seen to be bounded by 2\f{d^) — f{d^)\, which 
converges to zero almost surely. It follows that sup^g]g|Mfi(d) — M((i)| = 
Op(l). It remains to show that M((i'') > sup^^(^M((i) for every open interval 
G that contains d^. Since d^ is the unique maximizer of the continuous (in 
fact, differentiable) function M(d) and M(d^) = 0, it suffices to show that 
limrf^„ooM(d) < and lim(i^ooM(d) < 0. This is indeed the case, and will 
be demonstrated at the end of the proof. Thus, all conditions of Corollary 
3.2.3 are satisfied, and hence dn converges in probability to d^. 

Next we apply Theorem 5.2 to find the rate of convergence r„ of d„. Given 
e > 0, let J^n,e = [f{d") - M^/^/nJid°) + Me/y/n\, where is chosen in 
such a way that \/n{f{d^) — f{d^)) < M^, for sufficiently large n, with proba- 
bility at least 1 -e. Since f{d°) = 0^ + (3^)/2 is V^-consistent for /(d°), this 
can indeed be arranged. Then, setting F„ = f{d^), we have P{Fn ^ ^n,e) < £ 
for all sufficiently large n. We let d play the role of 9, with d^ = Oq, and 
define 

Mn{d,F) = P„[(y -F){l{X<d)-l{X< rfO))], 
M{d, F) = P[{Y - F){1{X <d)- 1{X < d°))]. 
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Then dn maximizes M„ = M„((i) and d° maximizes M((i, i<o), where 

-^0 = f{d^)- Consequently, 

M{d,Fo)-M{do,Fo)=M{d)-M{(f) < -C{d-d°f 

(for some positive constant C) for all d in a neighborhood of d^ (say d G 
[d^ — 6o,d^ + 6o]), on using the continuity of Wl"{d) in a neighborhood of 
d^ and the fact that Wl"{d^) < (which follows from arguments at the end 
of this proof). Thus (5.2) is satisfied. We will next show that (5.3) is also 
satisfied in our case, with 0n('5) = V^, for all 6 < 6o. Solving r'^(t>n{i^n^) ^ \/n 
yields r„ = n^/^, and we conclude that n^/^(J„ — d^) = Op{l). 
To show (5.3), we need to find functions 4>n{S) such that 

E* sup V^|M„(d,F) -M(d,Fo)| 

\d-d'^\<s,Fer„.e 

is bounded by cpniS). Writing G„ = ^/n{Fn — P), we find that the left-hand 
side of the above display is bounded by An + Bn , where 

An = E'' sup \Gn[{Y-F){l{X<d)-l{X<d^m 

\d~dO\<5,FGJ^n,e 

and 

Bn = E* sup ^\P[{F-Fo){l{X<d)-l{X<d'^))]\. 

\d-dO\<5,F£j^„,, 

First consider the term An- For sufficiently large n, 

An<E* sup \Gn[{Y-F){l{X<d)-l{X<d^m. 

|d-<iO|<<5,FG[Fo-l,F()+l] 

Denote by Ms the class of functions {{Y - F){1{X < d) - 1{X < d°)) :\d - 
d^\ < S,F [Fq — 1 , Fq + 1] } . An envelope function for this class is given by 
Ms = i\Y\ +Fo + 2)1(X G [d° -5,d° + 5]). From [21], page 291, using their 
notation, 

E\\\Gn\\Ms)<J{l,Ms){PM!)'/\ 

where Ms is an envelope function for Ais and J{l,Ais) is the uniform 
entropy integral (considered below). By straightforward computation, there 
exists (5o > such that for all 6 < 6o, we have E{M^) < 6, for a constant not 
depending on 6 (but possibly on 6o). Also, as will be shown below, J{l,Ais) 
is bounded for all sufficiently small 6. Hence, A„ < Next, note that 

Bn= sup MP[{F-Fo){l{X<d)-l{X<d'^))]\ 

\d-d'^\<S,FeJ^n,e 

<Me sup \Fx{d)-Fx{d'')\<Me6, 
\d-dO\<5 
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using condition (A3) in the last step. Hence An + Bn < \/^ + ^ 1^ V^) since 
5 can be taken less than 1. Thus the choice (f>n{^) = does indeed work. 
Now we check the boundedness of 

Jil, Ms) = sup [' Jl + logNir]\\Ms\\Q,2,Ms,L2iQ))dr] 
Q Jo ^ 

for small 6, as claimed above. Take any tj > 0. Construct a grid of points on 
[Fq — 1,Fo + 1] such that two successive points on the grid are at distance 
less than r] apart. This can be done using fewer than 3/r] points. Now, take 
a function in Ms- This looks like {Y - F){1{X <d)- 1{X < d°)) for some 
F £ [Fq — 1, Fq + 1] and some d with |d — d"! <6. Find the closest point to 
F on this grid; call this Fc. Note that 

\{Y - F){1{X <d)- 1{X < d")) - (y - F,)(1(X <d)- 1{X < dO))| 

<ril[Xe [d° -S,d° + S]] < i]Ms, 

whence 

\\{Y - F){l{X <d)- 1{X < d')) - (y - F,){1{X <d)- 1{X < d'))\\Q^, 

is bounded by r/||M5||Q^2- Now for any fixed point Fgrij on the grid, M.s,F Hd ~ 
{{Y - Fgrid){l{X <d)- 1{X < d^)):de [d° -6,d^ + 6]} is a VC-class'with 
VC-dimension bounded by a constant not depending on 5 or Fgrid- Also, Ms 
is an envelope for A^5,Fgrid' follows from bounds on covering numbers for 
VC-classes that N{r]\\Ms\\Q,2,M5,F^^,^, L2{Q)) < for some Vi > that 
does not depend on Q, Fgrij or 6. Since the number of grid points is of order 
1/r], using the bound on the above display we have 

N{2rj\\Ms\\Q,2,Ms,L2m<r^'''^'^- 

Using this upper bound on the covering number, we obtain a finite upper 
bound on J{l,Ais) ^or all 6 < 5o, via direct computation. This completes 
the proof that t„ = n^/^ (J^ _ ^0) ^ Op{l). 

Recalling notation from the proof of Theorem 2.3, we can write 

Qn{t) = in{dP + = Rn{t) + rn,i{t) + r„,2(0, 

where Rn{t) =n'^/^¥n[g{-,d^ + tn~'^/^)] with 

g{{X,Y),d) ={y- ^±^^ [1{X <d)- 1{X < rfO)], 

rn,i(t) = n^'\f{dP) - f{d'))M^n - P)[l{X <d' + - 1{X < d^)] 

and 
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Here, — >p uniformly on every compact set of the form [—K,K] by 
applying Donsker's theorem to the empirical process 

{v^(P„ - P){l{X <d° + s)- 1{X < : s e (-00, oo)} 

along with n^/^{f{(P) — f{(f)) = Op(l). The term r„^2(i) -^p uniformly on 
every [-K,K] smcen^^^{f{d^)- f{d^)) = Op{l) and n^^ g^p^^^_^^^^ p^^j^ < 

(i° + - 1{X < d°)) = 0(1). Hence, the limiting distribution of Quit) 

will be the same as the limiting distribution of Rnit)- We show that Rn-^dQo, 
where Qo is the Gaussian process defined in Theorem 2.3. Write 

=n2/3(P„ - P)[g{;d^+tn-'/^)]+n^/^P[g{;d^+tn~^/'')] 

= In{t)+Jn{t). 

In terms of the empirical process G„, we have /n(t) = G„ {fn,t) where 
/n,i(x, y) = n^'\y - f{d')){l{x <d' + tn~'/') - 1 (x < d")). 

We will use Theorem 2.11.22 from [21] to show that on each compact set 
[—K,K], Gnfn,t converges as a process in l°° [—K, K] to the tight Gaus- 
sian process aW{t), where = a'^{d^)px{d^)- Also, Jn{t) converges on 
every [—K,K] uniformly to the deterministic function — fogt^, with bo = 
\f'{d^)\pxid^)/2 > 0. Hence Q„(t) Qo(t) = aWit) - bot^ in Aoc(K), as 
required. 

To complete the proof, we need to show that /„ and J„ have the limits 
claimed above. As far as /„ is concerned, provided we can verify the other 
conditions of Theorem 2.11.22 from [21], the covariance kernel H{s,t) of the 
limit of Gnfn,t is given by the limit of P{fn,sfn,t) - Pfn,sPfn,t as n ^ oo. 
We first compute P{fn,sfn,t)- This vanishes if s and t are of opposite signs. 
For s,t>0, 

Pfn,sfn,t = E[n'/^{Y - f{d^))h{X G (rfO, d' + {sA t) n" V3]}] 

n'/^[E[{f{X) + £ - f{d^)f\X = x]]px{x) dx 

= n'/' / {<t\x) + (fix) - f{d^)f)px{x) dx 

^a2(d>x(d°)(sAt) 
= a^(s At). 

Also, it is easy to see that Pfn,s and Pfn,t converge to 0. Thus, when s, t > 0, 

P{fn,sfn,t) - Pfn,sPfn,t a\s At) = H{s, t). 

Similarly, it can be checked that for s,t <0, H{s,t) = a?{—s A —t). Thus 
H(s,t) is the covariance kernel of the Gaussian process aW{t). 
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Next we need to check 
(5.4) sup J\ogN{e\\Fn\\Q,2,J'n,L2{Q))de ^ 0, 

for every 5^ — > 0, where 

jr„ = {n^l\y - < d° + tn^^'^) - l(x < : t G [-if, if]} 

and 

is an envelope for J^n- From [21], page 141, 

iV(e||i^n||Q,2,.^n,i2(Q)) < ifF (16e)^(^") - 



for a universal constant K and < e < 1, where is the VC-dimension 

of J-ji- Since V{J^n) is uniformly bounded, we see that the above inequality 
implies 

N{e\\FjQ,2,:Fn,L2{Q))<(^^'' 

where s = sup„ 2(1/ (JF„) — 1) < oo, so (5.4) follows from 

rSn 

Jo 

as 5n — > 0. We also need to check the conditions (2.11.21) in [21]: 
P*Fl = 0{l), P*Fll{Fn>v^/^}^^ Vr/>0, 

and 

sup P{fn,s - fn,tf ^0 V5„ ^ 0. 
\s-t\<5„ 

With Fn as defined above, an easy computation shows that 

P*F^ = K—-- {a\x) + {f{x)-f{d')f)px{x)dx = 0{l). 

Denote the set [(i° - Kn'^/^.d^ + Kn^^/^] by Sn- Then 
F\Fll{Fn>n^/V^]) 

= E[n'/'\Y - /(d°)|2l{X G 5„}l{|y - f{d°)\l{X G 5„} > r/n^/^}] 

(5.5) 

< E[n'/'\Y - /(d°)|2l{X G 5„}l{|e| > rjn'/'' /2}] 

< £;[2nV3(e2 + ^f^x) - f{d^)f)l{X G 5„}l{|e| > 7?nV3/2}], 
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eventually, since for all sufficiently large n 

{|y - f{d^)\l{X G Sn} > c {\e\ > 7/^1/3/2}. 

Now, the right-hand side of (5.5) can be written as Ti + T2, where 
Ti = 2n^/^E[eh{\e\ > i]n^/'^ /2}1{X G 

and 

T2 = 2n'/^E[{f{X) - f{d^))h{X E Sn}l{\e\ > 
We will show that Ti = o(l). We have 

Ti = 2n^/3 / > r?n^/3/2}|X = x\px{x) dx. 

By (A5), for any ^ > 0, 

sup E[£^l{\e\ > r]n^l'^/2}\X = x] < ^ 

for n sufficiently large. Since v}^^ Jg^ Px{x) dx is eventually bounded by 2 x 
Kpx{d^) it follows that Ti is eventually smaller than 2^Kpx{d^)- We con- 
clude that Ti = 0(1). Next, note that (A5) implies that sup^.^^^ > 
rfv}/^ /2}\X = x] ^ as 77 ^ 00, so T2 = o(l) by an argument similar to that 
above. Finally, 

sup P{fn,s- fn,t? 

\s-t\<Sn 

as 5„ ^ can be checked via similar computations. 

We next deal with J„. For convenience we sketch the uniformity of the 
convergence of Jnit) to the claimed limit on < t < K. We have 

J^it) = n^/^E[{Y - f{(f)){l{X <(f +tn^^'^) -1{X <d^))] 

= n^/^E[{f{X) - f{d^))l{X G + tn"i/3])] 

= n2/3/ {f{x)-f{d''))px{x)dx 
Jd° 

t 



/ {f{d^ + un-^/'^)-f{d'^))px{d^ + un~^/^)du 
Jo 

rt f{d^ + un-'/^)-f{d') , „i/3 

/ u -7:5 px{d +un '•')du 

Jo un ^/•^ 

/* uf'{d^)px{d^) du (uniformly on < t < if) 
Jo 



\f'{<f)px{<f)t\ 
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It only remains to verify that (i) dP is the unique maximizer of M((i), (ii) 
M(-oo) < 0,M(oo) < and (iii) f'{d^)px{d^) < [so the process aW{t) + 
(/'(d°)px(c?°)/2)t^ is indeed in Cmax(IK)]- To show (i), recall that 



M{d)=E[g{{X,Y),d)]=E 



Y - ^' <d)- 1{X < 



Let ^{d) = E[Y - pfl{X < d) - (3^1{X > d)]^. By condition (Al), d^ is the 
unique minimizer of C{d). Consequently, d^ is also the unique maximizer of 
the function C{d^) Straightforward algebra shows that 

^{d')-m='2{P?-P'umd) 

and since Pf — (3^ > 0, it follows that d^ is also the unique maximizer of 
M(d). This shows (i). Next, 

Mid) = EiifiX) - <d)- 1{X < dO))] 

oo 

(fix) - /((i°))(l(x <d)- lix < d^))pxix) dx 



-oo 

d pd° 

{f{x)-f{cf))px{x)dx- / 

-oo J —oo 



I(-oo)= lim M{d) = - (fix)- f{d^))pxix)dx<0 



so that 



if and only if f{x)px [x) dx > f{d^)Fx (d^) if and only if (3f = /f^ /(x) x 
Px{x) dx/Fx{d^) > + /3j^)/2, and this is indeed the case, since Pi > /3^. 
We can prove that M(oo) < in a similar way, so (ii) holds. Also, M'(d) = 
(fid) - f{d^))px{d), so M'(dO) = 0. Finally, 

m"{d) = nd)px{d) + if id) - fid^Wxid), 

so M."id^) = f'id^)pxid^) < 0, since d^ is the maximizer. This implies (iii), 
since by our assumptions f'id^)pxid^) 7^ 0. □ 

Proof of Theorem 2.1. Let 9 denote the set of all possible values 
of (/?/,/?„, d) and let 6 denote a generic vector in G. Define the criterion 
function M(0) = Pmg, where 

meix, y) = iy- Piflix <d) + iy- p^flix > d). 

The vector Oq = (/3°,,3°,d°) minimizes M(6'), while (9„ = i(3u(3u,dn) mini- 
mizes M„(0) = 'FniTiQ. Since 6q uniquely minimizes M(^) under condition 
(Al), using the twice continuous differentiability of M at ^Oi we have 

Mieo)>cd^ie,eQ) 
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in a neighborhood of (for some C > 0), where d{-,-) is the /qo metric on 
M^. Thus, there exists 60 > sufficiently smah, such that for all {Pi,Pu,d) 
with \f3i — Pi\ < 60, \Pu — Pu\ < '^0 and jd — d^^l < 60, the above display holds. 

For all 5 < 5o we will find a bound on £'p||G„||;vi^ , where = {mg — 
mg^ : d{e, 60) < 6} and G„ = ^/n{Fn - P). From [21], page 298, 

E*p\\Gn\\Ms < JihMsXPMiy/^, 

where Ms is an envelope function for the class Ms- Straightforward algebra 
shows that 

{me - mgJ{X,Y) 

= 2{Y - f{d'ml - <d)- 1{X < dO)} 

+ {pf-(3i){2Y-(3f-Pi)l{X<d) 
+ (/?°-/3,)(2y-/30-/3„)l(X>d). 
The class of functions 

Mi,s = {2{Y - /(d°))(/3° - pf){l{X <d)- 1{X < d°)} :de[d'>-6,d' + 6]} 

is easily seen to be VC, with VC-dimension bounded by a constant not 
depending on 6; furthermore, Mi^s = 2\{Y - f{d^)){/3^ - Pf)\l{X e [d^ - 
5, (i*^ + 6]) is an envelope function for this class. It follows that 

N{e\\Mi4p,2,Mi,s,L2{P))<e-^\ 

for some Vi > that does not depend on 6. Next, consider the class of 
functions 

M2,s = - A)(21^ - f3f - Pi)l{X <d):d€[d^-S,d'' + 5], 

f3iG[(3f-6,(3f + S]}. 

Fix a grid of points {/3;,c} in [Pf — S,Pf + 6] such that successive points on 
this grid are at a distance less than e apart, where e = £6/2. The cardinality 
of this grid is certainly less than 35 /i. For a fixed (3i^c in this grid, the class of 
functions M2,5,c = {{Pf - Pi,c){2Y - [3^ - A,c)l(^ <d):d& [d^ - rfO + 5]} 
is certainly VC with VC-dimension bounded by a constant that does not 
depend on 5 or the point /3;^c- Also, note that M2^s = <^(2|^| + C*), where C 
is a sufficiently large constant not depending on 5, is an envelope function 
for the class A^2,(5) and hence also an envelope function for the restricted 
class with ^ held fixed. It follows that for some universal positive constant 
V2 > and any ?? > 0, 

N[ii\\M2A\p,2,M2Ac,L2{P))<r]-^'- 
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Now, ||M2,5||p,2 = 5||G||p,2, where G = 2|y| +C7. Thus, 

iV(e||G||p,2,A^2Ae,^2(P))< 

Next, consider a function g{X,Y) = {/3f - Pi){2Y - - (3i)l{X < d) inA^2,5- 
Find a fii^c that is within e distance of There are of order (J/e)^^ balls 
of radius e||G||p^2 that cover the class M.2,5,c-, so the function gc{X,Y) = 
{f3i — Pi^c)^^^ — Pi — Pi^c)^{X < d) must be at distance less than e||G||p,2 
from the center of one of these balls, say B. Also, it is easily checked that 
ll^ ~ 5'c||p,2 < e^||G||p^2- Hence g must be at distance less than 2e||G||p^2 from 
the center of i?. It then readily follows that 

iV(2e||G||p,2,A^2,5,^2(P))< (^ij , 

on using the fact that the cardinality of the grid {A,c} is of order 5/e. 
Substituting e5/2 for e in the above display, we get 

•1 \ V2+1 

N{e\\M2,s\\p,2,M2,s,L2{P))< I - ' 



Finally, with 

M3,5 = {(/?° - Pu){2Y - /?° - Pu)liX >d):de[d'>-5,d'> + 6], 

/3.G[/3°-<^,/?° + <5]} 

and M^^s = '5(2|1"| +C") for some sufhciently large constant C not depending 
on 5, we similarly argue that 

Nie\\M3^s\\p,2,M3,s,L2{P)) < Q)""'^'- 

for some positive constant V3 not depending on 6. The class Ms C A^i,^ + 

M2,s + Ms^s = Ms- Set Ms = Mi^s + ^2,5 + M^^s- Now, it is not difficult to 
see that 

This also holds for any probability measure Q such that < Eq(Y'^) < 00, 
with the constant being independent of Q or 6. Since Ms C Ms, it follows 
that 

"1^^ Vi+y2+V3 

Thus, with Q denoting the set of all such measures Q, 



N{3e\\Ms\\p,2,Ms,L2iP))<i - 



N{3e\\Ms\\Q,2,Ms,L2m<^ - 



J{1, Ms) = sup f Vl+log 
QeQJo ^ 



N{e\\Ms\\Q,2, Ms, L2{Q)) de < 00 
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for all sufficiently small 5. Next, 

PM| < PMls + PMl, + PMls <5 + 5^<5 

since we can assume S < 1. Therefore -Ep||G„||^^ < \/^, and (pni^) in The- 
orem 3.2.5 of [21] can be taken as V5. Solving r'^(j)n,{^/rn) < \fn yields 
'^n ^ n^/^, and we conclude that 

- fin - Pi dn - = Op(l). 

Having established the rate of convergence, we now determine the asymp- 
totic distribution. It is easy to see that 

n^/\i3i - f3lpu - Pldn - d°) = argmin Vn{h), 

h 

where 

(5.6) Vn{h) = n2/3(P„ _ P)K,^+^„-i/3 - mej + n"''' P[ms^,+hn-^/^ " 

for h = {hi, h2,hs) £ M^. The second term above converges to h'^Vh/2, uni- 
formly on every [— i^, i^]^ {K > 0), where V is the Hessian of the function 
6 I— > Pmg at the point 9q, on using the twice continuous differentiability of 
the function at 9q and the fact that minimizes this function. Note that 
y is a positive definite matrix. Calculating the Hessian matrix gives 

/ 2Fxid') iPf - Plpxid') \ 

We next deal with distributional convergence of the first term in (5.6), which 
can be written as ^(P„ - P) fn,h, where fn,h = fn,h,i + fn,h,2 + fn,h,3 and 

fn,h,i{x,y) = nV62(/30 - /J0)(y - /((i°))(l(x < + h^n-'/') - lix < d')), 

fn,h,2{^,y) = -n-V6/j,(2y - 2/3° - hn-'/^)l{x <do + hsn~'/''), 

fn,h,3{^,y) = -n-i/6/i2(2y - 2/3° - h2n-^/^)l{x > do + h^n-^'^). 

A natural envelope function F„ for J^n = {fn,h '■ h G [— ii', ET]'^} is given by 

F„(x,y) = 2ni/6|(/30 - (3l{y - f{d^))\l{x G [d° - An^i/^, d' + Kn-'/']} 

+ Kn-^/^{2\y - f3f\ + 1) + Kn~^'\2\y - I3l\ + 1). 

The limiting distribution of y/n{Pn — P)fn,h is directly obtained by appealing 
to Theorem 2.11.22 of [21]. On each compact set of the form [—K,K]^, 
the process i/n(P„ — P)fn,h converges in distribution to dW{h3), where 
a = 2\(3^ - (3Ha'^{d°)px{d°))^/^. This follows on noting that 

lim Pfn,sfn,h - Pfn,sPfn,h = S^dsg] A ] /ig ] ) 1 (sg/lg > 0), 
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by direct computation and verification of conditions (2.11.21) preceding the 
statement of Theorem 2.11.22; we omit the details as they are similar to 
those in the proof of Lemma 5.1. The verification of the entropy- integral 
condition, that is, 

sup/ J\ogN[e\\Fn\\Q^2,^n,L2{Q))de^Q 

Q Jo ^ 

as Sn 0, uses N{e \\Fn\\Q^2, ^n, L2{Q)) < for some V > not depending 
on Q; the argument is similar to the one we used earlier with J{l,Ais). 

It follows that the process Vn{h) converges in distribution in the space 
Bioc(I^^) to the process #(/ii, /12, /13) = dWihs) + h^Vh/2. The limiting 
distribution is concentrated on Cmin(IR^) [defined analogously to Cmax(]R^)]) 
which follows on noting that the covariance kernel of the Gaussian process 
W has the rescaling property (2.4) of [11] and that V is positive definite; 
furthermore, W{s) — W{h) has nonzero variance for s^h, whence Lemma 
2.6 of [11] forces a unique minimizer. Invoking Theorem 5.1 (to be precise, 
a version of the theorem with max replaced by min) , we conclude that 

(5.7) ( argmin V^(/i),minV^(/i) ) ( argminl^(/i),minl^(/i) ) . 

But note that 

m.\iiW{h) = min< aW{h^) + min 

h h-i y hi,h2 J 

and we can find argmin^^ explicitly. After some routine calculus, 

we find that the limiting distribution of the first component in (5.7) can be 
expressed in the form stated in the theorem. This completes the proof. □ 

Proof of Theorem 2.2. Inspecting the second component of (5.7), 
we find 

n-i/3RSSo(/?J',/3°,d°) = -minK(/i) ^-mmW{h) 

h h 

and this simplifies to the limit stated in the theorem. To show that n~^/'^RSSi((i'^) 
converges to the same limit, it suffices to show that the difference = 
n-^/3i^SSo(A°,/5S,d°) - n-i/3RSSi((i°) is asymptotically negligible. Some 
algebra gives that Dn = In + Jn, where 

In = n-i/3^(2y, - /30 - f3f)0f - f3f)l{X, < d^) 



i=l 



and 



i=l 
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Then 

In = - A°)ni/6p„[(2y - A" - /3f )1(X < (f)] 

= V^0f - A°)ni/''(P„ - PWY - A" - A°)l(^ < d^)] 
+ V^0f - A°)nV6p[(2y - AO - A°)1(X < d')] 

= In,l + In,2- 

Since V^(A° - A°) = Op(l) and 

nV6(p„ _ P)[(2y - AO - A°)1(X < rfO)] 

= ni/6(P„ - P)[{2Y - pf)l{X < 

-P?n'/^{Fn-P){l{X<d°)) 

is clearly Op(l) by the central limit theorem and the consistency of (3f, 
we have that In,i = Op(l). To show /„^2 = Op{l), it suffices to show that 
^i/6p[(2y - AO - (3f)l{X < dO)] ^ 0. But this can be written as 

^i/6p[2(y - pf)i{x < + ni/6p[(A° - A°)l(^ < d% 

The first term vanishes, from the normal equations characterizing [Pf,/]^, dP), 
and the second term is n ^/^0(n-^/2) 0. We have shown that /„ = Op(l), 
and Jn = Op{l) can be shown in the same way. This completes the proof. □ 
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